Point and interval estimates
NBIS, SciLifeLab
April 23, 2024
Unknown population parameters can be inferred from estimates from random samples from the population of interest. The sample estimate will be our best guess, a point estimate, of the population parameter.
The sample proportion and sample mean are unbiased estimates of the population proportion and population mean.
The expected value of an unbiased point estimate is the the population parameter that it estimates.
The sample estimate is our best guess, but it will not be without error.
To show the uncertainty an interval estimate for a population parameter can be computed based on sample data, instead of just a point estimate.
An interval estimate is an interval of possible values that with high probability contains the true population parameter.
The width of the interval estimate can be determined from the sampling distribution.
If the sampling distribution of the sample statistic of interest is unknown, a bootstrap interval can be computed instead.
Bootstrap is to use the data we have (our sample) and sample repeatedly with replacement from this sample.
Put the entire sample in an urn and resample!
Pollen example
If we are interested in how large proportion of the Uppsala population is allergic to pollen, we can investigate this by studying a random sample. We randomly select 100 persons in Uppsala and observe that 42 have a pollen allergy.
Based on this observation our point estimate of the Uppsala popultation proportion \(\pi\) is \(\pi \approx p = 0.42\).
Sample from the urn with replacement to compute the bootstrap distribution.
Pollen example
Sample an object with replacement 100 times and note the proportion allergic (black balls).
Repeat this many times to get a bootstrap distribution
Using the bootstrap distribution the uncertainty of our estimate of \(\pi\) can be estimated.
The 95% bootstrap interval is [0.32, 0.52].
The bootstrap is very useful if you do not know the distribution of our sampled propery. But in our example we actually do.
A confidence interval is a type of interval estimate associated with a confidence level.
An interval that with probability \(1 - \alpha\) cover the population parameter \(\theta\) is called a confidence interval for \(\theta\) with confidence level \(1 - \alpha\).
\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]
\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]
\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]
\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]
If \(\sigma\) is known
\[Z = \frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0, 1)\]
\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]
\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]
\(z_{\alpha/2}\) is the value such that \(P(Z \geq z_{\alpha/2}) = \frac{\alpha}{2} \iff P(Z \leq z_{\alpha/2}) = 1 - \frac{\alpha}{2}\).
For a 95% confidence, \(\alpha = 0.05\), and \(z_{\alpha/2} = 1.96\). For 90% or 99% confidence \(z_{0.05} = 1.64\) and \(z_{0.005}=2.58\).
If \(\sigma\) is known
\[Z = \frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0, 1)\] From the standard normal distribution we know;
\[P(z_{\alpha/2}<Z<z_{\alpha/2}) = 1-\alpha\]
\[P(z_{\alpha/2}<\frac{\bar X-\mu}{SEM}<z_{\alpha/2}) = 1-\alpha\]
\[P(\mu-z_{\alpha/2}SEM<\bar X<\mu+z_{\alpha/2}SEM) = 1-\alpha\]
\[P(\bar X-z_{\alpha/2}SE<\mu<\bar X+z_{\alpha/2}SE) = 1-\alpha\]
If \(\sigma\) is known
\[Z = \frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0, 1)\]
The confidence interval with confidence level \(1-\alpha\);
\[[\bar x_{obs} - z_{\alpha/2}SEM, \bar x_{obs} + z_{\alpha/2}SEM]\]
or
\[\mu = \bar x_{obs} \pm z_{\alpha/2}SEM\] where \(SEM = \frac{\sigma}{\sqrt{n}}\).
The mean of a sample of \(n\) independent and identically normal distributed observations \(X_i\) is normally distributed;
\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]
If \(\sigma\) is unknown and \(n\) is small?
Use the statistic \(t=\frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{s}{\sqrt{n}}} \sim t(n-1)\), t-distributed with \(n-1\) degrees of freedom.
It follows that
\[ \begin{aligned} P\left(-t < \frac{\bar X - \mu}{\frac{s}{\sqrt{n}}} < t\right) = 1 - \alpha \iff \\ P\left(\bar X - t \frac{s}{\sqrt{n}} < \mu < \bar X + t \frac{}{\sqrt{n}}\right) = 1 - \alpha \end{aligned} \]
The confidence interval;
\[[\bar x_{obs} - t \frac{s}{\sqrt{n}}, \bar x_{obs} + t \frac{s}{\sqrt{n}}]\]
or
\[\mu = \bar x_{obs} \pm t \frac{s}{\sqrt{n}}\]
The confidence interval with confidence level \(1-\alpha\) is thus;
\[\mu = \bar x_{obs} \pm t \frac{s}{\sqrt{n}}\]
For a 95% confidence interval and \(n=5\), \(t=\) 2.7764.
The \(t\) values for different values of \(\alpha\) and degrees of freedom are tabulated and can be computed in R using the function qt.
You study the BMI of male diabetic patients. In a sample of size 6 you observe; \(27, 25, 31, 29, 30, 22\). Assume that the BMI is normally distributed and calculate a 95% confidence interval for the mean BMI in male diabetic patients.
Remember that we can use the central limit theorem to show that
\[P \sim N\left(\pi, SE\right) \iff P \sim \left(\pi, \sqrt{\frac{\pi(1-\pi)}{n}}\right)\]
It follows that
\[Z = \frac{P - \pi}{SE} \sim N(0,1)\] Based on what we know of the standard normal distribution, we can compute an interval around the population property \(\pi\) such that the probability that a sample property \(p\) falls within this interval is \(1-\alpha\).
\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\\ P(-z_{\alpha/2} < \frac{P - \pi}{SE} < z_{\alpha/2}) = 1 - \alpha\]
We can rewrite this to
\[P\left(\pi-z_{\alpha/2} SE < P < \pi + z_{\alpha/2} SE\right) = 1-\alpha\] In words, a sample fraction \(p\) will fall between \(\pi \pm z_{\alpha/2} SE\) with probability \(1- \alpha\).
The equation can also be rewritten to
\[P\left(P-z SE < \pi < P + z SE\right) = 1 - \alpha\]
The observed confidence interval is what we get when we replace the random variable \(P\) with our observed fraction,
\[p-z SE < \pi < p + z SE\] \[\pi = p \pm z SE = p \pm z \sqrt{\frac{p(1-p)}{n}}\]
The 95% confidence interval \[\pi = p \pm 1.96 \sqrt{\frac{p(1-p)}{n}}\]
A 95% confidence interval will have 95% chance to cover the true value.
Back to our example of proportion pollen allergic in Uppsala. \(p=0.42\) and \(SE=\sqrt{\frac{p(1-p)}{n}} = 0.0494\).
Hence, the 95% confidence interval is \[\pi = 0.42 \pm 1.96 * 0.05 = 0.42 \pm 0.092\] or \[(0.42-0.092, 0.42+0.092) = (0.32, 0.52)\]